Use direct mixture sampling in simulation #21
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #20
This PR optimizes the sampling implementation in the synthetic data generation function and makes it at least 10x faster, so that one can simulate X at 100k x 100k scale easily.
Old implementation
In the previous (trivial) implementation, we generate each document by a two-step hierarchical procedure.
Then sum up to get document-term counts. It strictly follows the data model but means a double for-loop and is slow.
New implementation
In the new implementation, we combined the two steps into one by directly sampling from the mixture:
This leverages the multinomial distribution property: If$X \sim \text{Multinomial}(n, p)$ and $Y \sim \text{Multinomial}(X, q)$ , then $Y \sim \text{Multinomial}(n, p \cdot q)$ . Meaning if we first choose a category from a multinomial and then choose an outcome within that category from another multinomial, the result is equivalent to a single multinomial draw from the mixture distribution.
This means a matrix multiplication with a multinomial draw and is much faster. The most time-consuming data in #20 now takes < 5s to generate.